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Abstract —In this paper, we propose an online algorithm to 
compute matrix factorizations. Proposed algorithm updates the 
dictionary matrix and associated coefficients using a single 
observation at each time. The algorithm performs low-rank 
updates to dictionary matrix. We derive the algorithm by defining 
a simple objective function to minimize whenever an observation 
is arrived. We extend the algorithm further for handling missing 
data. We also provide a mini-batch extension which enables to 
compute the matrix factorization on big datasets. We demonstrate 
the efficiency of our algorithm on a real dataset and give com¬ 
parisons with well-known algorithms such as stochastic gradient 
matrix factorization and nonnegative matrix factorization (NMF). 

Index Terms —Matrix factorizations, Online algorithms. 

I. Introduction 

ROBLEM of online factorization of data matrices is of 
special interest in many domains of signal processing and 
machine learning. The interest comes from either applications 
with streaming data or from domains where data matrices are 
too wide to use batch algorithms. Analysis of such datasets 
is needed in many popular application domains in signal 
processing where batch matrix factorizations is successfully 
applied Uj. Some of these applications include processing and 
restoration of images ja, source separation or denoising of 
musical El, El and speech signals 0, and predicting user 
behaviour from the user ratings (collaborative filtering) 0. 
Nowadays, since most applications in these domains require 
handling streams or large databases, there is a need for online 
factorization algorithms which updates factors only using a 
subset of observations. 

Formally, matrix factorization is the problem of factorizing 
a data matrix Y G into 0, 

Y ^CX (1) 

where C € and X G Intuitively, r is the 

approximation rank which is typically selected by hand. These 
methods can be interpreted as dictionary learning where 
columns of C defines the elements of the dictionary, and 
columns of X can be thought as associated coefficients. 
Online matrix factorization problem consists of updating C 
and associated columns of X by only observing a subset of 
columns of Y which is the problem we are interested in this 
work. 

In recent years, many algorithms were proposed to tackle 
online factorization problem. In ||7], authors propose an al¬ 
gorithm which couples the expectation-maximization with 
sequential Monte Carlo methods for a specific Poisson non¬ 
negative matrix factorization (NMF) model to develop an 
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online factorization algorithm. The model makes Markovian 
assumptions on the columns of X, and it is similar to the 
classical probabilistic interpretation of NMF 0, but a dynamic 
one. They demonstrate the algorithm on synthetic datasets. 
In 0 , authors propose an online algorithm to solve the 
Itakura-Saito NMF problem where only one column of data 
matrix is used in each update. They also provide a mini-batch 
extension to apply it in a more efficient manner and demon¬ 
strate audio applications. In m, authors propose several 
algorithms for online matrix factorization using sparsity priors. 
In HD, authors propose an incremental nonnegative matrix 
factorisation algorithm based on an incremental approximation 
of the overall cost function with video processing applications. 
In HU, authors implement a stochastic gradient algorithm 
for matrix factorization which can be used in many different 
settings. 

In this paper, we propose an online algorithm to compute 
matrix factorizations, namely the online matrix factorization 
via Broyden updates (OMF-B). We do not impose nonnega¬ 
tivity conditions although they can be imposed in several ways. 
At each time, we assume only observing a column of the data 
matrix (or a mini-batch), and perform low-rank updates to 
dictionary matrix C. We do not assume any structure between 
columns of X, and hence Y, but it is possible to extend our 
algorithm to include a temporal structure. OMF-B is very 
straightforward to implement and has a single parameter to 
tune aside from the approximation rank. 

The rest of the paper is organised as follows. In Section 
m we introduce our cost function and the motivation behind 
it explicitly. In Section |III] we derive our algorithm and 
give update rules for factors. In Section HV] we provide two 
modifications to implement mini-batch extension and update 
rules for handling missing data. In Section |V] we compare 
our algorithm with stochastic gradient matrix factorization and 
NMF on a real dataset. Section |Vl] concludes this paper. 

II. The Objective Function 

We would like to solve the approximation problem ([T]i by 
using only columns of Y at each iteration. For notational 
convenience, we denote the fc’th column of Y with G R™. 
In the same manner, we denote the fc’th column of X as 
Xk G where r is the approximation rank. This notation 
is especially suitable for matrix factorisations when columns 
of the data matrix represent different instances (e.g. images). 
We set [n] = {1,..., n} for n G N. 

We assume that we observe random columns of Y. To 
develop an appropriate notation, we use yk^ to denote the 
data vector observed at time t where kt is sampled from \n] 
uniformly random. The use of this notation implies that, at 



2 


time t, we can sample any of the columns of Y denoted by yk^ ■ 
This randomization is not required and one can use sequential 
observations as well by putting simply kt = t. We denote the 
estimates of the dictionary matrix C at time f as Ct- As stated 
before, we would like to update dictionary matrix C and a 
column of the X matrix Xkt after observing a single column 
yk-t of the dataset Y. For this purpose, we make the following 
crucial observations: 

(i) We need to ensure yk^ ~ CtXkt at time t for kt € [n], 

(ii) We need to penalize Ct estimates in such a way that 
it should be “common to all observations”, rather than 
being overfitted to each observation. 

As a result we need to design a cost function that satisfies 
conditions (i) and (ii) simultaneously. Therefore, for fixed t, 
we define the following objective function which consists of 
two terms. Suppose we are given yk^. for kt G \n] and Ct-i, 
then we solve the following optimization problem for each t, 

= argmin||j/fc, - CtXkt\\\ + X\\Ct - Ct-i\\], (2) 

xkt,Ct 

where A S K. is a parameter which simply chooses how much 
emphasis should be put on specific terms in the cost function. 
Note that, Eq. (|2|i has an analytical solution both in Xkt and 
Ct separately. The first term ensures the condition (i), that is, 
ykt ~ CtXkt ■ The second term ensures the condition (ii) which 
keeps the estimate of dictionary matrix C “common” to all 
observations. Intuitively, the second term penalizes the change 
of entries of Ct matrices. In other words, we want to restrict Ct 
in such a way that it is still close to Ct-i after observing ykt 
but also the error of the approximation ykt ~ CtXkt is small 
enough. One can use a weighted Frobenius norm to define a 
correlated prior structure on Ct ifTSl . but this is left as a future 
work. 


III. Online Factorization Algorithm 

For each t, we solve dU by fixing Xkt and Ct- In other 
words, we will perform an alternating optimisation scheme at 
each step. In the following subsections, we derive the update 
rules explicitly. 


A. Derivation of the update rule for Xkt 

To derive an update for Xkt, Ct is assumed to be fixed. To 
solve for Xkt, let Gkt denote the cost function such that, 

Gkt = \\ykt-CtXkt\\^p + \\\Ct-Ct-i\^p, 

and set Gkt = 0- We are only interested in the first term. 
As a result, solving for Xkt becomes a least squares problem, 
the solution is the following pseudoinverse operation [12], 


Xkt = {GjCt)-^Cj 


Vkt 


(3) 


Algorithm 1 OMF-B 
1: Initialise Cq randomly and set f = 1. 

2: repeat 

3: Pick kt G [n] at random. 

4: Read ykt S K™ 

5: for Iter = 1 : 2 do 

Xkt = {GjGt)-^Gjykt 
n -Gt-ixkt)xi^ 

X + xl^Xkt 

6 : end for 

7: t i — f -f 1 

8: until convergence 


B. Derivation of the update rule for Ct 

If we assume Xkt is fixed, the update with respect to Ct can 
be derived by setting VctGkt = 0- We leave the derivation to 
the appendix and give the update as. 


Ct = {\Ct-i + yktxlt){XI + xktxl^) S (4) 

and by using Sherman-Morrison formula Cl for the term 
(A/ + XktX^^)~^, Eq. (Hi can be written more explicitly as. 


Ct = Ct-i + 


(jjkt -Ct-iXkt)xl^ 
X + xJxkt 


(5) 


which is same as the Broyden’s rule of quasi-Newton methods 
as A —y 0 d- We need to do some subiterations between 
updates (H and (|5]) for each t. As it turns out, empirically, 
even 2 inner iterations are enough to obtain a reliable overall 
approximation error. 


IV. Some Modifications 

In this section, we provide two modifications of the Al¬ 
gorithm [T] The first modification is an extension to a mini¬ 
batch setting and requires no further derivation. The second 
modification provides the rules for handling missing data. 


A. Mini-Batch Setting 

In this subsection, we describe an extension of the Algo¬ 
rithm [T] to the mini-batch setting. If n is too large (e.g. hun¬ 
dreds of millions), it is crucial to use subsets of the datasets. 
We use a similar notation, where instead of kt, now we use an 
index set vt C [nj. We denote a mini-batch dataset at time t 
with y^t- Hence G where |i;t| is the cardinality of 

the index set vt- In the same manner, Xyt G denotes 

the corresponding columns of the X. We can not use the 
update rule (|5]) immediately by replacing ykt with (and Xkt 
with Xvt) because now we can not use the Sherman-Morrison 
formula for ®. Instead we have to use Woodbury matrix 
identity ifll . However, we just give the general version @ and 
leave the use of this identity as a choice of implementation. 
Under these conditions, the following updates can be used for 
mini-batch OMF-B algorithm. Update for x^t reads as, 

Xyt = {CjCt)-^Cjyyt 


for fixed Ct- 


( 6 ) 
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Algorithm 2 OMF-B with Missing Data 
1: Initialise Cq randomly and set t = 1. 

2: repeat 

3: Pick kt G [n] at random. 

4: Read yk^ G M"* 

5: for Iter = 1 : 2 do 

Xkt = {{Mct © Ct)^{Mct 0 Ct)) X 

{Me, QCt)^{mk, Qykt) 

„ „ , {mk, Q {Vk, - Ct-iXk,))xJ^ 

X + xl^Xk, 

6 : end for 

7: t i — t -{- 1 

8: until convergence 


and update rule for Ct can be given as, 

Ct = {XCt-i + yvtxJu^iXI Xy,x{l,^) ^ (7) 

which is no longer same as the Broyden’s rule for mini-batch 
observations. 


As a result the update rule for Xk, becomes the following 
pseudoinverse operation, 

Xk, ={{Mc, 0 CtViMc, © Ct))-^x 

{Me, QCt)^{mk, Qyk,), 

and the update rule for Ct (for fixed can trivially be given 
as, 

^ ^ , {mk, & {yk, - Ct-iXk,))xJ^ 

X + xl^Xk, 

We denote the results on dataset with missing entries in 
Experiment IV-BI 

V. Experimental Results 

In this section, we demonstrate two experiments on the 
Olivetti faces dataseQ consists of 400 faces with size of 
64 X 64 grayscale pixels. We hrst compare our algorithm with 
stochastic gradient descent matrix factorization in the sense 
of error vs. runtimes. In the second experiment, we randomly 
throw away the %25 of each face in the dataset, and try to fill- 
in the missing data. We also compare our results with NME 

m. 


B. Handling Missing Data 

In this subsection, we give the update rules which can han¬ 
dle the missing data. We only give the updates for single data 
vector observations because deriving the mini-batch update for 
missing data is not obvious and also become computationally 
demanding as \vt\ increases. So we only consider the case 
|ut| = 1 i.e. we assume observing only a single-column at a 
time. 

We define a mask M G {0,1}™^", and we denote the data 
matrix with missing entries with MQY where © denotes the 
Hadamard product. We need another mask to update related 
entries of the estimate of the dictionary matrix Ct, which is 
denoted as Me, and naturally. Me, G {0, l}™xr Suppose 
we have an observation yk, at time t and some entries of the 
observation are missing. We denote the mask vector for this 
observation as ruk, which is fct’th column of M. We construct 
Me, for each t in the following way: 

Me, = [mk„...,mk,]. 

r times 

The use of Me, stems from the following fact. We would like 
to solve the following least squares problem for Xk, (for hxed 

Ct), 

II 112 

min mk, Q {yk, - CtXk,) \\„. (8) 

Xk, 

One can easily verify that, 

mk, © {CtXk,) = {Me, © Ct) Xk,. 

Then ® can equivalently be written as, 

II 11 2 

min {nik, © yk,) - {Me, © Ct) Xk, L. 

Xk, ^ 


A. Comparison with stochastic gradient MF 

In this section, we compare our algorithm with the stochas¬ 
tic gradient descent matrix factorization (SOME) algorithm 
na. Notice that one can write the classical matrix factoriza¬ 
tion cost as, 

n 

\\Y-wH\\l^J2\\y>^-^^4l 

k=l 

SO it is possible to apply alternating stochastic gradient algo¬ 
rithm na. We derive and implement the following updates 
for SOME, 

Wt=Wt.i- ir^wWvk, - Whk,\\l 

" W=Wt-i 

^k, = - It'^hllvk - Wth\\l _ 

for uniformly random kt G [n] for each t. The follow¬ 
ing conditions hold for convergence: lY — 

TZ, bn < oo and same conditions hold for 'y^. In 
practice we scale the step-sizes like a/t^ where 0 < a < oo 
and 0.5 < /3 < 1. These are other parameters we have to 
choose for both W and h. It is straightforward to extend this 
algorithm to mini-batches ifT^ . We merely replace kt with vt- 

In this experiment, we set identical conditions for both 
algorithms, where we use the Olivetti faces dataset, set r = 30, 
and use mini-batch size 10 for both algorithms. We have 
carefully tuned and investigated step-size of the SOME to 
obtain the best performance. We used scalar step sizes for the 
matrix W and we set a step-size for each mini-batch-index, 
i.e. we use a matrix step-size for updating hy,. We set A = 10. 
At the end, both algorithms passed 30 times over the whole 

'Available at: http://www.cs.nyu.edu/~roweis/data.html 
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(a) 



Fig. 1. Comparison with SGMF on Olivetti faces dataset, (a) This plot 
shows that although SGMF is faster than our algorithm (since we employ 
two iterations for each mini-batch), and SGMF processes the dataset in a 
much less wall-clock time, we achieve a lower error in the same wall-clock 
time, (b) This plot shows that our algorithm uses samples in a more efficient 
manner. We obtain lower errors for the same processed amount of data. 


dataset taking mini-batch samples at each time. We measure 
the error by taking Frobenius norm of the difference between 
real data and the approximation. 

The results are given in Fig. [T] We compared error vs. 
runtimes and observed that SGMF is faster than our algorithm 
in the sense that it completes all passes much faster than OMF- 
B as can be seen from Fig. [11 a). However our algorithm uses 
data much more efficiently and achieves much lower error rate 
at the same runtime by using much fewer data points than 
SGMF. In the long run, our algorithm achieves a lower error 
rate within a reasonable runtime. Additionally, our algorithm 
has a single parameter to tune to obtain different error rates. 
In contrast, we had to carefully tune the SGMF step-sizes 
and even decay rates of step-sizes. Compared to SGMF, our 
algorithm is much easier to implement and use in applications. 


B. Handling missing data on Olivetti dataset 

In this experiment, we show results on the Olivetti faces 
dataset with missing values where %25 of the dataset is 
missing (we randomly throw away %25 of the faces). Although 
this dataset is small enough to use a standard batch matrix 
factorisation technique such as NMF, we demonstrate that our 
algorithm competes with NMF in the sense of Signal-to-Noise 
Ratio (SNR). We compare our algorithm with NMF in terms 
of number of passes over data vs. SNR. We choose A = 2, and 
set inner iterations as 2. Our algorithm achieves approximately 
same SNR values with NMF (1000 batch passes over data) 
with only 30 online passes over dataset. This shows that our 
algorithm needs much less low-cost passes over dataset to 
obtain comparable results with NMF. Numbers and visual 
results are given in Fig. [2] 



Fig. 2. A demonstration on Olivetti faces dataset consists of 400 faces of size 
64 X 64 with %25 missing data. Some example faces with missing data are 
on the left. Compaiison of results of OMF-B (middle) with 30 online passes 
over dataset and NMF with 1000 batch iterations (right). Signal-to-noise ratios 
(SNR) are: OMF-B: 11.57, NMF: 12.13 where initial SNR is 0.75. 


VI. Conclusions and Future Work 

We proposed an online and easy-to-implement algorithm 
to compute matrix factorizations, and demonstrated results 
on the Olivetti faces dataset. We showed that our algorithm 
competes with the state-of-the-art algorithms in different con¬ 
texts. Although we demonstrated our algorithm in a general 
setup by taking random subsets of the data, it can be used 
in a sequential manner as well, and it is well suited to 
streaming data applications. In the future work, we plan to 
develop probabilistic extensions of our algorithm using recent 
probabilistic interpretations of quasi-Newton algorithms, see 
e.g. m and ESI. The powerful aspect of our algorithm is 
that it can also be used with many different priors on columns 
of X such as the one proposed in lITbl . As a future work, we 
think to elaborate more complicated problem formulations for 
different applications. 
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Appendix 

We derive Vct^kt the following. First we will find 

II 11 2 

^Ct\\ykt ~ CtXktW^ which is the derivative of the first term. 
Notice that 

\\ykt - CtXk, II 2 = Tr - ‘^yJCtXk, + xJ^CjCtXk,) 

First of all the first term is not important for us, since it does 
not include Ct- Using standard formulas for derivatives of 
traces iflTl . we arrive, 

Vctilyfet ~ C'(a;fcj||2 = ~‘^ykt^kt 
The second term of the cost function can be written as, 

A||a - Ct-i\\l = ATr {{Ct - Ct-iV{Ct - Ct-i)) 

If we take the derivative with respect to Ct using properties 
of traces m, 

VcA\\Ct - Ct-i\\^p = 2XCt - 2XCt-i (10) 

By summing (|9]l and (ITOl l. setting them equal to zero, and 
leaving Ct alone, one can show (jUi easily. Using Sherman- 
Morrison formula, one can obtain the update rule given in the 
Eq. ©. 
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